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RULE INDUCTION FOR SUMMARIZING 
THE CLASSES IN A CLASSIFIED 
DOCUMENT COLLECTION 

DESCRIPTION 

5 BACKGROUND OF THE INVENTION 

Field of the Invention 

The present invention generally relates to a 
method and apparatus for providing document 
summaries and, more particularly, to a method and 
10 apparatus for providing summaries of documents 

belonging to a class in a classified document 
collection. 

Background Description 

Businesses and institutions generate countless 
15 amounts of documents in the course of their commerce 

and activities. These documents range from business 
proposals and plans to intra-office correspondences 
between employees and the like. 

The documents of a business or institution 
20 represent a substantial resource for that business 

or institution. Thus, in order to more effectively 
store these documents it is not uncommon for the 
business or institution to digitally store these 
documents on a magnetic disc or other appropriate 
25 media. 
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One known method for electronically storing the 
documents is to first scan the documents, and then 
process the scanned images by optical character 
recognition software to generate machine language 
5 files. The generated machine language files are then 

compactly stored on magnetic or optical media. 
Documents originally generated by a computer, such 
as with word processor, spread sheet or database 
software, can of course be stored directly to 

10 magnetic or optical media. 

There is a significant advantage from a storage 
and archival stand point to storing documents, but 
there remains a problem of retrieving information 
from the stored documents. In the past, retrieval 

15 of the documents has been accomplished by separately 

preparing an index to access the documents. To this 
end, a number of full text search software products 
have been developed which respond to structured 
queries to search a document database. 

20 In order to further search documents, it is not 

uncommon for retrieval systems to prepare summaries 
of stored documents so that a user only has to read 
through the document summaries in order to find 
relevant documents. The use of such summary 

25 retrieval systems thus greatly reduces the time 

required to review the stored documents and thus 
provides reduced costs associated with the search 
and review of the stored documents. 

Document summaries can be generated after 

30 document creation either manually or automatically. 

Of course, manually creating summaries provides high 
quality, but is cost prohibitive due to the labor 
intensive tasks of manually reading and summarizing 
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the documents. On the other hand, automatic 
summaries are less expensive, but current systems do 
not obtain consistently high quality document 
summaries . 

5 A common approach for automatically generating 

document summaries of individual documents relies 
upon either natural language processing or 
quantitative content analysis. Natural language 
processing is computationally intensive, while 

10 quantitative content analysis relies upon 

statistical properties of text to produce summaries. 
In both cases (e.g., natural language processing or 
quantitative content analysis) , a document is 
typically processed in isolation to determine 

15 important words or phrases or terms, and then those 

words or phrases or terms are used to provide a 
summary of that particular processed document. 
Thus, in order to provide summaries for individual 
documents, each document is first separately 

20 processed to determine the important words or 

phrases or terms therein, and thereafter further 
processed to match those important words to provide 
a summary thereof. As is well understood by one of 
ordinary skill in the art, this type of approach is 

25 resource inefficient and time consuming. 

By way of example, U.S. Patent No. 5,689,716 to 
Chen discloses an automatic method of generating 
thematic summaries of a single document. The Chen 
technique begins with determining the number of 

30 thematic terms to be used based upon the number of 

thematic sentences to be extracted in the document. 
The Chen method then identifies the thematic terms 
within the document, and afterward, each sentence of 
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the document is scored based upon the number of 
thematic terms contained within the sentence. The 
desired number of highest scoring sentences are then 
selected as thematic sentences. This same process 
5 must be used for any additional documents. 

A variant of the Chen method is disclosed in 
U.S. Patent No. 5,384,703 to Withgott, et al . 
Withgott uses regions instead of sentences, and more 
specifically, discloses a method and apparatus for 

10 summarizing documents according to theme. By using 

the method and apparatus of Withgott a summary of a 
document is formed by selecting regions of a 
document, where each selected region includes at 
least two members of a seed list. The seed list is 

15 formed from a predetermined number of the most 

frequently occurring complex expressions in the 
document that are not on a stop list. If the summary 
is too long, the region-selection process is 
performed on the summary to produce a shorter 

20 summary. This region- selection process is repeated 

until a summary of that particular document is 
produced having a desired length. Each time the 
region selection process is repeated, the seed list 
members are added to the stop list and the 

25 complexity level used to identify frequently 

occurring expressions is reduced. Similar to Chen, 
this same process must be used for any additional 
documents . 

An approach used for providing a single summary 
30 for an entire collection of documents is disclosed 

in "Generating Natural Language Summaries from 
Multiple On-Line Sources" Dragomir Radev et al, 
Computational Linguistics, vol. 99, Nov. 9, 1998. 
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In the Radev approach, linguistic analysis of a 
document collection includes filling predefined 
templates or information structures, and then using 
natural language generation techniques to provide a 
5 readable version of the formatted template. 

Accordingly, what is needed is a method and 
system which is capable of providing a summary of 
individual documents without having to perform a 
resource intensive process on each individual 
10 document. What is further needed is a method and 

system which is capable of providing a summary of 
more than one document belonging to a class in a 
classified document collection. 
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SUMMARY OF THE INVENTION 

The present invention is directed to a method 
and apparatus for providing summaries of documents 
belonging to a class of documents in a classified 
5 document collection. In embodiments of the present 

invention, a sample set of documents belonging to 
one or more classes is processed via a machine 
learning system in order to induce a set of rules 
associated with the sample set of documents. 

10 order to induce the rules associated with 

the sample set of documents one of any known machine 
learning system may be implemented by the apparatus 
of the present invention, such as, for example, (i) 
a rule based engine, (ii) decision tree system (iii) 

15 a multiplicative update based algorithm engine or 

(iv) any other well known machine learning system 
from which rules can be derived. By way of one 
example, the machine learning system trains on the 
set of sample pre-classif ied documents by (i) 

20 preparing the sample set of documents, (ii) training 

on the sample set of documents and (iii) testing a 
set of the preclassif ied documents. 

Once the rules or set of rules are induced, the 
set of rules may then be extracted (e.g., decomposed 

25 to provide a concise description of the class) and 

used to provide summaries for individual documents 
belonging to the same class of documents as the 
sample documents. More specifically, the words or 
phrases or terms of each incoming document may be 

30 matched to the extracted rules or set of rules 

associated with the sample documents in order to 
provide a summary of each of the incoming documents. 
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It is contemplated that in addition to 
providing summaries for each of the incoming 
documents, a header or other identifying feature of 
the incoming document may be provided with the 
summary of the incoming document. This allows the 
user to easily determine which summary belongs to 
which summarized document. In further embodiments, 
the summary of each document may be provided with an 
"address", or may equally be provided with a 
hyperlink to the incoming document. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

The foregoing and other objects, aspects and 
advantages will be better understood from the 
following detailed description of ja preferred 
5 embodiment of the invention with reference to the 

drawings, in which: 

Figure 1 is a general layout of a flow diagram 
for processing of sample data marked as belonging to 
one or more classes; 
10 Figure 2 is a general layout of the present 

invention; and 

Figure 3 is a flow diagram showing the steps 
needed to implement the method of the present 
invention. 

15 
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DETAILED DESCRIPTION OF A PREFERRED 
EMBODIMENT OF THE INVENTION 

The present invention is directed to a method 
and apparatus for providing document summaries. 
5 More specifically, the present invention is directed 

to a method and apparatus for providing summaries of 
incoming documents belonging to a class of documents 
in a classified document collection by, in 
embodiments, (i) processing a sample set of 

10 documents in order to induce a set of rules (e.g., 

vocabulary of words or attributes of the sample 
document) which provide a characterization of each 
class in the collection (ii) comparing extracted 
words, phrases, terms and the like from the set of 

15 induced rules to each individual incoming document 

and (iii) providing a summary of the incoming 
document based on any matches between the extracted 
rules induced from the sample set of documents and 
words, terms or phrases of the incoming document. 

20 Being even more specific and in order to 

accomplish the objectives of the present invention, 
a sample set of documents (e.g., data) is provided 
to a machine learning system in order to induce a 
vocabulary of rules or set of rules (e.g., 

25 attributes of the sample document) associated with 

the sample set of documents. Typically, the sample 
set of documents belongs to one or more classes. 
The rules or set of rules associated with the sample 
set of documents may then be extracted, and the 

30 method and apparatus of the present invention 

compares these extracted rules to an incoming stream 
of documents in order to provide a summary for each 
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of these incoming documents. It is well understood 
by one of ordinary skill in the art of rule based 
learning systems that "extracted" refers to the 
decomposition of a rule to provide a concise 
5 description of the class of sample set of documents 
such as, for example, words, phrases, terms and the 
like . 

In embodiments, prior to comparing the rules 
with the incoming documents, the incoming documents 

10 may be refined by, for example, (i) eliminating or 

combining similarly defined words (e.g., synonyms) 
within the document, (ii) eliminating stems of words 
(e.g., "ing", "s", "ed", etc.) or (iii) countless 
other modification of the incoming document. This 

15 provides a more concise description of the incoming 
documents thereby increasing the efficiency of the 
method and apparatus of the present invention. 

Thus, the approach of the present invention 
solves the problem of (i) first individually 

20 processing a single document in order to find 

important words, phrases, etc., (ii) processing the 
document a second time in order to match the 
important words, phrases, etc. within the document, 
(iii) providing a summary for that individual 

25 document based on the matched words, phrases, etc., 
and (iv) repeating steps (i)-(iii) for each further 
document. Instead, the present invention is capable 
of first defining a set of rules based on a sample 
set of documents, and using the words, phrases, 

30 terms and the like extracted from the rules to 

provide summaries for a stream of incoming 
documents. Thus, the apparatus and method of the 
present invention provides an efficient and cost 
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effective means for providing summaries of 
documents . 

It is well understood that the apparatus and 
method of the present invention can be implemented 
5 using a plurality of separate dedicated or 

programmable integrated or other electronic circuits 
or devices (e.g., hardwired electronic or logic 
circuits such as discrete element circuits, or 
programmable logic devices such as PLDs, PLAs, PALs, 

10 or the like) . A suitably programmed general purpose 

computer, e.g., a microprocessor, microcontroller or 
other processor device (CPU or MPU) , either alone or 
in conjunction with one or more peripheral (e.g., 
integrated circuit) data and signal processing 

15 devices can be used to implement the invention. In 

general, any device or assembly of devices on which 
a finite state machine capable of implementing the 
flow charts shown in the figures can be used as a 
controller with the invention. 

20 Processing a Sample Set of Data 

Referring now to the drawings, and more 
particularly to Figure 1, there is shown a general 
layout of a flow diagram for processing sample 
documents marked as belonging to one or more 

25 classes. It should be understood that Figure 1 may 

equally represent a high level block diagram showing 
an apparatus for processing a set of sample 
documents. It is further well understood that the 
specific processing of sample documents of Figure 1 

30 is but one example of processing sample documents, 

and that any well known method of processing sample 
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documents in accordance with the present invention 
is contemplated for use herein. Thus, the specific 
example of Figure 1 is not critical to the 
understanding of the present invention and is used 
5 merely as one illustration of processing sample 

documents in order to provide rules therein. 

Still referring to Figure 1, in step S10, 
sample input documents are marked as belonging to 
one or more classes (e.g., pre-classif ied sample 

10 input documents) . In step S20, the sample input 

documents are provided to a machine learning system 
which trains on the sample input documents in order 
to induce a set of rules associated with the one or 
more classes, as discussed in more detail below. 

15 In step S3 0, the rules for characterizing each 

class is provided. These rules may be, for example, 
a vocabulary of words that are characteristic of the 
sample input documents in the one or more class or 
other attributes associated with the sample input 

20 documents. In embodiments, the rules may be further 

refined by any number of processes, such as, for 
example, morphological analysis, stemming, 
tokenization and the like. In step S4 0, the set of 
rules are extracted (e.g., decomposed to provide a 

25 concise description of the class such as, for 

example, words, phrases, terms and the like) for use 
with the method and apparatus of the present 
invention, as will be described in greater detail 
with reference to Figures 2 and 3 . 

30 

Machine Learning System 

The machine learning system of step S20 may be, 
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for example, (i) a rule based engine, (ii) decision 
tree system (iii) a multiplicative update based 
algorithm engine or (iv) any other well known 
machine learning system from which rules can be 
5 derived. The machine learning system is used to 

provide the rules associated with the sample input 
documents of one or more class. 

By way of one example, the machine learning 
system, in step S2 0, trains on the set of sample 

10 pre-classified documents by (i) preparing the sample 

set of documents, (ii) training on the sample set of 
documents and (iii) testing a subset of the 
preclassif ied documents. As discussed in co-pending 
U.S. patent application no. 09/176,322, incorporated 

15 herein by reference in its entirety, data 

preparation typically involves obtaining a corpus of 
pre-classified data and training involves training a 
classifier (e.g., machine learning system) on a 
corpus of pre-classified documents. Testing 

20 includes testing the machine learning system with 

some subset of the pre-classified documents set 
aside for this purpose. The process of generating 
training vectors of the present invention may be 
divided into three steps, which are strictly 

25 illustrative of one example contemplated for use 

with the present invention. Accordingly, other 
known processes for generating training vectors can 
work equally well with the present invention. The 
following is provided as one example of generating 

30 training vectors: 

1. Feature definition: Typically this 
involves breaking the text up into tokens. Tokens 
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can then be reduced to their stems or combined to 
multi-word terms. 

2. Feature count: Typically this involves 
counting the frequencies of tokens in the input 

5 texts. Tokens can be counted by their absolute 

frequency, and several relative frequencies 
(relativized to the document length, the most 
frequent token, square root, etc.). 

3. Feature selection: This step includes 

10 weighting features (e.g., depending on the part of 

the input text they occur in: title vs. body), 
filtering features depending on how distinctive they 
are for texts of a certain class (filtering can be 
done by stop word list, based on in- class vs. 

15 out-class frequency etc.). 

It is well understood that the above method for 
providing rules is one example contemplated for use 
with the present invention. Accordingly, other well 
known methods, including manually inducing rules 

20 based on the set of sample input documents, are 

contemplated for use with the present invention. 
Thus, the present invention should not be limited in 
any way to the above illustrated method of obtaining 
rules for a set of sample documents belonging to one 

25 or more classes. 

The System of the Present Invention 

Once the rules are induced and extracted, words 
or phrases of each incoming document can then be 
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matched to the extracted rules associated with the 
sample documents in order to provide a summary of 
each of the incoming documents. That is, the 
extracted words, terms or phrases of the rules are 
5 matched to the words, terms or phrases of the 

incoming document in order to provide a summary 
thereof. Thus, as can clearly be seen, by using the 
apparatus and method of the present invention, each 
incoming document does not have to first be 

10 processed to find the most important words and the 

like in order to provide a summary of the incoming 
document, but may instead be compared to rules of 
sample documents belonging to the same class as each 
input document . 

15 Referring now to Figure 2, a block diagram of 

the present invention is shown. More specifically, 
an input module 50 and a rule module 60 are 
provided. The input module provides one or more 
input documents to a comparer module 70, while the 

20 rule module 60 provides the rules (hereinafter 

referred to as extracted rules (i.e., words terms or 
phrases extracted from the vocabulary of the induced 
rules) induced from the sample set of documents to 
the comparer module 70. It is well understood that 

25 the extracted rules may be associated with one or 

more classes of sample input documents and may be 
used to provide a summary for each incoming document 
belonging to the same one or more classes. 

The comparer module 70 compares each of the 

30 incoming documents to the extracted rules and 

determines whether there are any matches between the 
extracted rules and the words (or terms or phrases) 
of each of the incoming documents. The matches 
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between the extracted rules and any words (or terms 
or phrases) of each of the incoming documents are 
provided in the display module 80 as a summary for 
each of the incoming documents. 

It is contemplated by the present invention 
that in addition to a summary provided for each of 
the incoming documents, a header or other 
identifying feature associated with each of the 
incoming documents may be provided with the summary 
of each of the incoming documents. This allows the 
user to easily determine which summary belongs to 
which incoming document. It is important to note 
that the present invention is not limited to any 
specific identifying features and may include, for 
15 example, the title of the incoming document, the 

author of the incoming document, the date the 
incoming document was created or any other known 
identifying feature of the incoming document. 

In still further embodiments of the present 
20 invention, the display may show an "address" 

(associated with the incoming documents) , or may 
equally provide a hyperlink to the incoming 
documents. Other methods of retrieving the incoming 
documents (associated with the displayed 
25 summaries) are also contemplated for use with the 

present invention . 

Method of Use of the Present Invention 

Figure 3 shows a flow diagram depicting the 
steps of implementing the method of the present 
30 invention. In step S100, a user inputs one or more 

input documents into the apparatus of the present 
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invention. The input documents may be provided via 
the Internet, an intranet, LAN or other similar 
systems. In embodiments, in step S105, each of the 
incoming documents may be refined in order to 
5 provide a more concise description of the incoming 

document. Such processing may include (i) stemming, 
(ii) tokenization or (iii) any other well known text 
processing techniques. In step S110, the extracted 
set of rules for each class obtained via steps S10- 

10 S40 of Figure 1 are input into the apparatus of the 

present invention. As with the incoming documents, 
the extracted set of rules may be provided via the 
Internet, an intranet, LAN or other similar systems. 
Still referring to Figure 3, in step S120, a 

15 determination is made as to whether there are any 
matches between the words or phrases or terms of 
each individual incoming document and the extracted 
rules of the sample documents. If there are no 
matches, then no summaries are presented and, in 

20 step S13 0, a determination is made as to whether 

there are any further documents to be input to the 
apparatus of the present invention. If there are no 
further documents, then the method of the present 
invention ends in step S140; however, if there are 

25 further document (s) , then those additional documents 

are input in step S100. 

If there are matches between the words or 
phrases or terms of an individual incoming document 
and the extracted rules of the sample documents, 

30 then the matching words of the incoming document are 

presented in step S150, which in preferred 
embodiment is a summary for each document within a 
class. It is well understood that steps S105, S12 0 
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and S150 are equally used with all incoming 
documents, and may be implemented via one document 
at a time or any combination of incoming documents 
depending on the particular desires of the user of 
5 the method and apparatus of the present invention. 

In view of the above discussion, it is now well 
understood that the presented matching words are 
used as the summaries of each of the incoming 
documents . 

10 In step S160, the documents associated with the 

summaries may be provided. The step of SI 60 may be 
provided by furnishing a document address or 
hyperlink, for example. 

Example of Use of The Present Invention 

15 Provided herein is one illustrative example of 

providing a summary for individual documents in one 
or more classes using the method and apparatus of 
the present invention. It should be understood that 
the following example does not in any manner 

20 whatsoever limit the scope of the present invention, 

and it should further be realized that there are 
many further examples that may equally be used with 
the present invention. 

By way of example, a set of documents relating 

25 to "purchasing a new home" (e.g., a class of 

documents) is provided to a machine learning system. 
The set of documents are trained on in order to 
obtain a set of rules, such as, for example, 
"purchase price" and "state of residence". The rules 

30 are then extracted and compared to incoming 

documents in order to provide a summary of the 
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incoming documents. 

In the present example, the first incoming 
document includes words that are matched with the 
extracted set of rules obtained from the sample set 
of documents; that is, for example, "a purchase 
price of a house in Armonk, New York is $150,000". 
The apparatus of the present invention then displays 
this information as a summary to the incoming 
document. The apparatus of the present invention 
may further provide a header which identifies the 
summary as belonging to the document titled "Home 
Costs Are on the Rise". This same method can then 
be used for a second, third, etc. incoming document 
in order to provide a summary of the second, third, 
etc. incoming document. 

As seen, by using the apparatus and method of 
the present invention, individual summaries of 
incoming documents can be obtained easily and cost 
efficiently. 

While the invention has been described in terms 
of preferred embodiments, those skilled in the art 
will recognize that the invention can be practiced 
with modification within the spirit and scope of the 
appended claims. 
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CLAIMS 

Having thus described our invention, what we 
claim as new and desire to secure by Letters Patent 
is as follows: 

1 1 . A method of providing summaries of documents 

2 belonging to a class of a classified document 

3 collection comprising: 

4 inducing a set of rules from a sample set of 

5 documents, the set of rules being characteristic of 

6 the sample input documents; 

7 comparing extracted words, phrases, terms and 

8 the like appearing in the set of rules induced from 

9 the sample set of documents to an individual 

10 incoming input document; and 

11 providing a summary of the individual incoming 

12 input document based on matches between the 

13 extracted words, phrases, terms and the like and the 

14 individual incoming input document. 

1 2. The method of claim 1, further comprising: 

2 providing more than one individual incoming 

3 input document; and 

4 comparing the more than one individual incoming 

5 input document with the extracted words, phrases, 

6 terms and the like appearing in the set of rules 

7 induced from the sample set of documents in order to 

8 provide summaries of at least one of the more than 

9 one individual incoming input document. 

1 3. The method of claim 2, wherein the more than 

2 one individual incoming input document is at least 
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3 two individual incoming input documents. 

1 4. The method of claim 3, wherein: 

2 the sample set of documents belong to one or 

3 more classes of a classified document collection; 

4 the at least two individual incoming input 

5 documents belong to the one or more classes of the 

6 classified document collection; and 

7 the comparing step compares the at least two 

8 individual incoming input documents with the 

9 extracted words, phrases, terms and the like 

10 appearing in the set of rules in order to provide a 

11 summary for each of the at least two individual 

12 incoming input documents, 

13 wherein the comparing step compares a same 

14 class of the at least two individual incoming input 

15 documents and the sample set of documents. 

1 5. The method of claim 1, wherein: 

2 the sample set of documents belong to one or 

3 more classes of a classified document collection; 

4 and 

5 the individual incoming input document belongs 

6 to one of the one or more classes of the classified 

7 document collection. 

1 6. The method of claim 5, wherein the comparing 

2 step compares a same class of the individual 

3 incoming input document and the sample set of 

4 documents in order to provide a summary for the 

5 individual incoming input document . 
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7. The method of claim 1, wherein: 

the comparing step compares the extracted 
words, phrases, terms and the like appearing in the 
set of rules to words or phrases or terms in the 
individual incoming input document; and 

the summary of the individual incoming input 
document includes word or term or phrase matches 
between the individual incoming input document and 
the extracted words, phrases, terms and the like 
appearing in the set of rules. 

8. The method of claim 7, further comprising 
refining the individual incoming input document 
prior to the comparing step. 

9. The method of claim 8, wherein the refining 
step includes (i) stemming, (ii) tokenization or 
(iii) other morphological text processes. 

10. The method of claim 1, further comprising 
providing an identifying feature of the individual 
incoming input document associated with the summary 
of the individual incoming input document, wherein 
the identifying feature includes at least (i) a 
title of the individual incoming input document, 
(ii) a date of creation of the individual incoming 
input document, (iii) author 1 s name of the 
individual incoming input document and the like. 

11. The method of claim 1, further comprising 
providing a means for obtaining the individual 
incoming input document after a summary of the 
individual incoming input document is provided. 
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1 14. The method of claim 1, further comprising 

2 training on the sample set of documents in order to 

3 induce the set of rules. 

1 15. A method of providing summaries of documents 

2 belonging to a class of a classified document 

3 collection comprising: 

4 inducing a set of rules from a sample set of 

5 documents belonging to one or more classes of a 

6 classified document collection, the induced set of 

7 rules being characteristic of the sample input 

8 documents; 

O 9 extracting the set of rules in order to provide 

J 1° a concise description of the one or more classes of 

H 11 the sample documents; 

% 12 comparing the extracted set of rules to at 

|! 13 least one individual incoming input document 

14 belonging to a same class as the sample set of 

H 15 documents; and 

□ 16 providing a summary of each of the at least one 

% 11 individual incoming input document based on matches 

Q 18 between the set of extracted rules and the 

19 individual incoming input document. 

1 16. A means for providing summaries of documents 

2 belonging to a class of a classified document 

3 collection comprising: 

4 means for inducing a set of rules from a sample 

5 set of documents, the induced set of rules being 

6 characteristic of the sample input documents; 

7 means for comparing extracted words, phrases, 

8 terms and the like appearing in the set of rules 
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9 induced from the sample set of documents to at least 

10 one individual incoming input document; and 

11 means for providing a summary of each of the at 

12 least one individual incoming input document based 

13 on matches between a vocabulary of the set of rules 

14 induced from the sample set of documents and the at 

15 least one individual incoming input document. 

1 17. The means of claim 16, wherein the inducing 

2 means further comprises means for extracting the set 

3 of rules in order to provide a concise description 

4 of one or more classes associated with the sample 

5 documents . 

1 18. The means of claim 17, wherein: 

2 the comparing means compares the concise 

3 description of the extracted set of rules to the at 

4 least one individual incoming input document; and 

5 the summary of the individual incoming input 

6 document includes the word or term or phrase matches 

7 between the at least one individual incoming input 

8 document and the concise description of the 

9 extracted set of rules induced from the sample set 
10 of document s . 

1 19. The means of claim 16, further comprising 

2 identifying means for identifying the summary as 

3 belonging to the at least one individual incoming 

4 input document wherein the identifying feature 

5 includes at least (i) a title of the document, (ii) 

6 a date of creation of the document, (iii) author's 

7 name and the like. 
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1 20. The means of claim 16, further comprising 

2 means for refining the individual incoming input 

3 document prior to the comparing step, wherein the 

4 refining includes at least (i) stemming, (ii) 

5 tokenization or (iii) morphological text processing 

6 and the like. 

1 21. A computer program product comprising: 

2 a computer usable medium having computer 

3 readable program code embodied in the medium for 

4 query-object synthesis/modification, the computer 

5 program product having: 

6 first computer program code for inducing a set 

7 of rules from a sample set of documents, the sample 

8 set of documents belonging to at least one class of 

9 a classified document collection and the set of 

10 rules being characteristic of the sample input 

11 documents; 

12 second computer program code for extracting the 

13 set of rules in order to provide a concise 

14 description of one or more classes of the sample 

15 documents; 

16 third computer program code for comparing the 

17 extracted set of rules to at least one individual 

18 incoming input document; and 

19 fourth computer program code for providing a 

20 summary of each of the at least one individual 

21 incoming input document based on matches between the 

22 set of extracted rules and the individual incoming 
2 3 input document . 
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RULE INDUCTION FOR SUMMARIZING 
THE CLASSES IN A CLASSIFIED 
DOCUMENT COLLECTION 



ABSTRACT OF THE DISCLOSURE 



A method and apparatus for providing summaries 
of documents belonging to a class of documents in a 
classified document collection. A sample set of 
documents belonging to one or more classes is 
processed via a machine learning system in order to 
induce a set of rules associated with the sample set 
of documents. The vocabulary in the rules are 
extracted and compared to words, terms or phrases of 
an incoming document. Any matches between the 
extracted rules and the words, terms or phrases of 
the incoming document are used as a summary for the 
incoming document. By using the method and 
apparatus, each document does not have to be 
processed to find most important words and the like 
in order to provide a summary for that document and 
then repeating the same process for additional 
documents . 
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Application for United States Patent 
Declaration and Power of Attorney 

As a below named inventor, I hereby declare that: 

My residence, post office address and citizenship are as stated below next to my name; 

1 believe I am an original, first and joint inventor of the subject matter which is claimed and for which a patent is sought 
on the invention entitled RULE INDUCTION FOR SUMMARIZING THE CLASSES IN A CLASSIFIED DOCUMENT 
COLLECTION the specification of which: 

(check & is attached hereto 

one) 

□ was filed on as 

Application Serial No, _ 



and was amended on (if applicable) 

I hereby state that I have reviewed and understand the contents of the above identified specification, including the claims, 
as amended by any amendment referred to above. 

I acknowledge the duty to disclose information which is material to the examination of this application in accordance with 
Title 37, Code of Federal Regulations, § 1.56(a).* 

Jrj I hereby claim foreign priority benefits under Title 35, United States Code, § 1 19 of any foreign applicaiion(s) for patent or 
motor's certificate listed below and have also identified below any foreign application for patent or inventor's certificate having a 
filMf daie before that of the application on which priority is claimed; 

Prior Foreign Appiication(s) Priority Claimed 



(Nipiber) (Country) (Day/Momh/Year Filed) yes 



no 



(Ngmber) (Country) (Day/Month/Year Filed) yes no 

p I hereby claim the benefit under Title 35, United States Code, § 120 of any United States application(s) listed below and, 
ins^r as the subject matter of each of the claims of this application is not disclosed in the prior United States application in the ' 
ma§|er provided by the first paragraph of Title 35, United States Code, § 112, 1 acknowledge the duty to disclose material 
inflation as defined in Title 37, Code of Federal Regulations, §1 .56(a) which occurred between the filing date of the prior 
application and the national or PCT international filing date of this application: 



(Application Serial No,) (Filing Date) (Status: patented, pending, abandoned) 

Power of Attorney: As a named inventor, I hereby appoint Manny W. Schecter, Reg. No. 31,722, Teny J Ilardi Reg 
No. 29,936, Stephen C. Kaufman, Reg. No. 29,551, Louis J. Percello, Reg. No. 33,206, Jay P. Sbrollini, Reg. No. 36 266 
Robert M. Trcpp, Reg. No. 25.933, Daniel P. Morris, Reg. No. 32,053, Kevin P. Jordan, Reg No 40 277 Douglas W 
Cameron, Reg. No. 31,596, David M. Shofi, Reg. No. 39,835, Christopher A. Hughes, Reg. No. 26,914, Edward A, Pennington 
Reg, No. 32,588, John E. Hoel, Reg. No. 26,279, C. Lamont Whitham, Reg. No. 22,424, Marshall M. Curtis, Reg. No. 33,138 ' 
and Michael E. Whitham, Reg. No. 32,635, as attorneys and/or agents to prosecute this application and transact all business in the 
Patent and Trademark Office connected therewith. Ail correspondence should be directed to Whitham, Curtis & Whitham Reston 
International Center, 1 1800 Sunrise Valley Drive, Suite 900, Reston, Virginia 20191. Phone calls should be directed to ' 
Whitham, Curtis & Whitham, at 703/391-2510. 
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I hereby d$d?^fai all statements made herein of my own knowledge are true and that all statements made on information and 
belief are believed to be true; and farther that these statements were made with the knowledge that willful false statements and the 
like so made are punishable by fine or imprisonment, or both, under Section 1001 of Title 18 of the United States Code and that 
such willful false statements may jeopardize the validity of the application or any patent issued thereon. 

(1) Inventor: David E. Johnson, 

Signature: yVt"$^|t£W Date: ^jl fe? 

Residence: 187 Frederick Street, Cortlandt Manor, New York 10567 
Citizenship: United States of America 
Post Office Address: same as above 

(2) Inventor: Fredejjck J. Damerau 
Signature: 



^ j^yjJ^wZ^ Date; ?/?A 9 

Residence: 356 Nash Road, North Salem, New York 10560 




Citizenship: United States of America 
Post Office Address: same as above 
*TfS(e 37, Code of Federal Regulations, §1. 56(a): 

(a) ^7A duty of candor and good faith toward the Patent and Trademark Office rests on the inventor, on each attorney or agent who 
prepares or prosecutes the application and on every other individual who is substantively involved in the preparation or prosecution 
of %p application and who is associated with the inventor, with the assignee or with anyone to whom there is an obligation to 
assljgn die application. All such individuals have a duty to disclose to the Office information they are aware of which is material to 
ih£§ptamiiiation of the application. Such information is material where there is substantial likelihood that a reasonable examiner 
wd|d consider ii important in deciding whether to allow the application to issue as a patent. The duty is commensurate with the 
de^pee of involvement in the preparation or prosecution of the application. 

(b) t]Lrnder this section, information is material to patentability when it is not cumulative to information already of record or being 
ma&e of record in the application, and (1) it establishes, by itself or in combination with other information, a prima facie case of 
un^entabilixy; or (2) it refutes, or is inconsistent with, a position the applicant takes in: (i) opposing an argument of 
unreliability relied on by the Office, or (ii) asserting an argument of patentability. 
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